Introduction

Are we alone in the Universe? This is one of the most profound questions that humankind has sought to answer since the beginning of recorded history. We can gain some insight into this mystery with the modern search for exoplanets. The underlying purpose of contemporary exoplanet programs is to discover habitable planets, especially around nearby stars, and find evidence of life elsewhere in the cosmos. Using Earth and its lifeforms as a template to assess habitability, we seek other celestial bodies with conditions similar to our own, i.e., with Earth-like surface gravities and temperatures where liquid water could exist. In order to meet these criteria, most habitable worlds must be a particular mass and a particular distance from their host star. We are interested in analyzing the data for confirmed exoplanets to asses our progress in finding a planet with similar physical characteristics to Earth.

For our investigation, we used data from the NASA Exoplanet Archive, which can be found at the following URL https://exoplanetarchive.ipac.caltech.edu. This archive is an online compilation, collation, and cross-correlation of astronomical data and information on exoplanets and their host stars. The data are vetted by a team of astronomers at the California Institute of Technology (Caltech) under contract with the National Aeronautics and Space Administration (NASA) under the Exoplanet Exploration Program. An extensive overview of the data, services and tools of the archive can be found in a published paper by Akeson, et al. (2013, PASP, 125, 989) in the Publications of the Astronomical Society of the Pacific (PASP). This publication can be found here.

We downloaded our dataset from the NASA Exoplanet Archive on 2018 July 5. The data consists of 354 columns and 3,748 rows of information on confirmed exoplanets and their host stars as well as information about their discovery. Discovery information includes the method used to detect the exoplanet, the locale of the observatories used for detection (i.e., whether ground-based, space-based, or a mixture of both observations were used for detection), and the year of discovery. Physical characteristics of exoplanets present in the archive include planetary mass, orbital period, and orbital semi-major axis. Physical properties of host stars listed in the archive include stellar mass, stellar radius, effective temperature, surface gravity, spectral type, luminosity, and distance from Earth. These physical properties for both exoplanets and their host stars are important for determining the similarity between an exoplanet and Earth, and their detection sensitivity.

We trimmed and removed space characters from the original dataset from the NASA Exoplanet Archive to the 15 columns that are most relevant and directly related to habitable planets. These data are stored in the file planets.csv. The variable for which we are most interested as a response for this study is pl_orbsmax, the planet orbital semi-major axis in astronomical units (\(\rm{AU} = 1.495978707 \times 10^{11}\,\rm{m}\)). This response variable is a key physical parameter that determines the habitability of a planet.

Methods

Data preprocessing

To ensure we had a consistent dataset with no empty values for any responses or predictors for which we were considering, we filtered out rows with any empty values and were left with 594 rows.

planets <- read.csv("planets.csv")
planets_good <- planets[complete.cases(planets), ]

The structure of the R object for which we will perform our analysis is as follows.

str(planets_good)
## 'data.frame':    594 obs. of  15 variables:
##  $ pl_discmethod: Factor w/ 10 levels "Astrometry","EclipseTimingVariations",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ pl_disc      : int  2007 2009 2008 2002 1996 2018 2010 2010 2009 2008 ...
##  $ pl_locale    : Factor w/ 3 levels "Ground","MultipleLocales",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ st_dist      : num  110.6 119.5 76.4 18.1 21.4 ...
##  $ st_optmag    : num  4.74 5.02 5.23 6.61 6.25 ...
##  $ st_teff      : num  4742 4213 4813 5338 5750 ...
##  $ st_mass      : num  2.7 2.78 2.2 0.9 1.08 0.99 1.54 1.54 1.93 0.98 ...
##  $ st_rad       : num  19 29.79 11 0.93 1.13 ...
##  $ st_logg      : num  2.31 1.93 2.63 4.45 4.36 2.42 3.5 3.5 4.43 1.71 ...
##  $ st_metfe     : num  -0.35 -0.02 -0.24 0.41 0.06 -0.77 -0.03 -0.03 0.19 -0.46 ...
##  $ st_vsini     : num  1.2 1.5 2.6 1.6 2.18 3.36 2.77 2.77 38.3 1 ...
##  $ pl_orbper    : num  326 516 186 1773 798 ...
##  $ pl_orbsmax   : num  1.29 1.53 0.83 2.93 1.66 ...
##  $ pl_orbeccen  : num  0.231 0.08 0 0.37 0.68 0.042 0.09 0.29 0.29 0.38 ...
##  $ pl_bmasse    : num  6166 4685 1526 1481 566 ...

There are two factor variables and 13 numeric variables. The following table briefly describes the variables.

Variable name Variable description
pl_discmethod Method of discovery (RadialVelocity, Transit, TransitTimingVariations, etc.)
pl_disc Year of discovery
pl_locale Locale of discovery (Ground, MultipleLocales or Space)
st_dist Stellar distance (parsecs)
st_optmag Stellar apparent magnitude (mag)
st_teff Stellar effective temperature (Kelvin)
st_mass Stellar mass (solar masses)
st_rad Stellar radius (solar radii)
st_logg Stellar surface gravity (log g)
st_metfe Stellar metallicity (log(Fe/H or M/H))
st_vsini Stellar projected rotational velocity (m/s)
pl_orbper Planet orbital period (days)
pl_orbsmax Planet orbital semi-major axis (AU)
pl_orbeccen Planet orbital eccentricity
pl_bmasse Planet mass (Earth masses)

Some of these variables may be collinear to each other. We can inspect for collinearity visually using a matrix of scatterplots of the variables.

From the matrix of scatterplots, it appears that planet orbital period (pl_orbper) and orbital semi-major axis (pl_orbsmax) are collinear. This collinearity is not unexpected since orbital period and semi-major axis are related to each other physically as supported in Kepler’s third law of planetary motion. It also appears that stellar radius (st_rad) and stellar surface gravity (st_logg) are collinear as well. Collinearity makes estimating model coefficients and interpreting models more difficult, but it does not affect model predictions. We will keep this in mind as we search for potential models for our data.

Preliminary model diagnostics

Assumption of homoscedasticity

We will check the constant variance assumption for our latest model using a residual versus fitted values plot, and using a Breusch-Pagan test.

By visual inspection, the general span of residuals is not very constant across the range of fitted values. This suggests that homoscedasticity or constant variance is suspect.

Next, we will perform the Breusch-Pagan test on the model.

library(lmtest)
log_pl_orbsmax_mod_back_bic_bp <- bptest(log_pl_orbsmax_mod_back_bic)

The p-value from the Breusch-Pagan test for the model is \(1.6911\times 10^{-25}\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(1.6911\times 10^{-25} \ll \alpha\). This further suggests that homoscedasticity is suspect.

Assumption of normality

We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.

By visual inspection, the data seems to deviate slightly from normality in the lower part of the Q-Q plot. The normality assumption may be suspect.

Next, we will perform the Shapiro-Wilk test on the model.

log_pl_orbsmax_mod_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_back_bic))

The p-value from the Shapiro-Wilk test for the model is \(0.0024\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.0024 \lt \alpha\). This further suggests that normality is suspect.

Results

The R summary for the model we have chosen to examine the orbital semi-major axis of exoplanets is given below.

summary(log_pl_orbsmax_mod_exh_back_bic)
## 
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_locale + st_optmag + st_vsini + 
##     log(pl_bmasse), data = planets_good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.832 -0.826 -0.003  0.832  3.166 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.32730    0.25613    1.28      0.2    
## pl_localeSpace  1.11819    0.17278    6.47  2.0e-10 ***
## st_optmag      -0.38017    0.02037  -18.67  < 2e-16 ***
## st_vsini       -0.04028    0.00611   -6.60  9.4e-11 ***
## log(pl_bmasse)  0.38803    0.02919   13.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.18 on 589 degrees of freedom
## Multiple R-squared:  0.534,  Adjusted R-squared:  0.53 
## F-statistic:  168 on 4 and 589 DF,  p-value: <2e-16

Model diagnostics

Assumption of homoscedasticity

We will check the constant variance assumption for our final model using a residual versus fitted values plot, and using a Breusch-Pagan test.

By visual inspection, the general span of residuals is mostly constant across the range of fitted values although many of the datapoints are clustered in two areas.

Next, we will perform the Breusch-Pagan test on the model.

library(lmtest)
log_pl_orbsmax_mod_exh_back_bic_bp <- bptest(log_pl_orbsmax_mod_exh_back_bic)

The p-value from the Breusch-Pagan test for the model is \(0.3016\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.3016 \gt \alpha\). This suggests that homoscedasticity is not suspect.

Assumption of normality

We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.

By visual inspection, the data seems to consistent with normality.

Next, we will perform the Shapiro-Wilk test on the model.

log_pl_orbsmax_mod_exh_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_exh_back_bic))

The p-value from the Shapiro-Wilk test for the model is \(0.3725\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.3725 \gt \alpha\). This suggests that normality is not suspect.

Outlier diagnostics

We will see how influential points affect the coefficients. Here, we define influential points as those with a Cook’s distance larger than four divided by the number of points. We will remove the influential points and refit the model.

cd_log_pl_orbsmax_mod_exh_back_bic = cooks.distance(log_pl_orbsmax_mod_exh_back_bic)
log_pl_orbsmax_mod_exh_back_bic_fix = lm(log_pl_orbsmax_mod_exh_back_bic, data = planets_good, subset = cd_log_pl_orbsmax_mod_exh_back_bic <= 4 / length(cd_log_pl_orbsmax_mod_exh_back_bic))
summary(log_pl_orbsmax_mod_exh_back_bic_fix)
## 
## Call:
## lm(formula = log_pl_orbsmax_mod_exh_back_bic, data = planets_good, 
##     subset = cd_log_pl_orbsmax_mod_exh_back_bic <= 4/length(cd_log_pl_orbsmax_mod_exh_back_bic))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0477 -0.7670  0.0016  0.7817  2.6925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.6205     0.2512    2.47    0.014 *  
## pl_localeSpace   1.1508     0.1708    6.74  3.9e-11 ***
## st_optmag       -0.4124     0.0197  -20.93  < 2e-16 ***
## st_vsini        -0.0565     0.0108   -5.22  2.5e-07 ***
## log(pl_bmasse)   0.3971     0.0286   13.86  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 566 degrees of freedom
## Multiple R-squared:  0.591,  Adjusted R-squared:  0.588 
## F-statistic:  204 on 4 and 566 DF,  p-value: <2e-16

Overall, the coefficients of the model fit with the influential points removed is similar to those of the model fit with the influential points included. The adjusted \(R^2\) for the model without influential points is somewhat higher with a value of 0.588.

Doing a Breusch-Pagan test on the model fit without influential points, we see that the p-value from the test is \(0.0037\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.0037 \lt \alpha\). This suggests that homoscedasticity is suspect. Therefore, we will continue to use the model derived from data with the influential points.

Collinearity diagnostics

Next, we will look at the variance inflation factor (VIF) to examine collinearity in our model variables. The following table lists the predictors in our model and its corresponding VIF.

Variable name VIF
pl_locale 1.597
st_optmag 1.539
st_vsini 1.035
log(pl_bmasse) 1.104

None of the model variables has a VIF greater than five, so we do not suspect any significant collinearity in our model.

Model comparison

Model Adjusted \(R^2\) LOOCV RMSE Breusch-Pagan test decision Shapiro-Wilks test decision # of Predictors
Additive Model 0.3031 1.259 Reject Reject 14
Backward BIC Model 0.2957 1.252 Reject Reject 7
Logarithmic Additive Model 0.5940 1.117 Reject Reject 14
Logarithmic Backward BIC Model 0.5892 1.112 Reject Reject 9
Logarithmic Exhaustive Model 0.5387 1.177 Fail to Reject Fail to Reject 8
Logarithmic Exhaustive Backward BIC Model 0.5304 1.184 Fail to Reject Fail to Reject 4
Logarithmic Exhaustive Backward BIC Model without Influential Points 0.5880 1.102 Reject Fail to Reject 4

We prefer models where assumptions of homoscedasticity and normality are not suspect. While the logarithmic additive and logarithmic backward BIC models had larger adjusted \(R^2\) values and lower LOOCV-RMSE values, the Breusch-Pagan and Shapiro-Wilks tests results would reject the hypothesis of constant variance and normality, respectively. The logarithmic exhaustive backward BIC model has comparable adjusted \(R^2\) and LOOCV-RMSE values and with only four predictors compared to eight predictors for its counterpart without applying a backward BIC search. Finally, we would choose the logarithmic exhaustive backward BIC model with influential points over its counterpart without influential points because the assumption of constant variance is suspect for the latter.

Discussion

The R summary for the model we have chosen to examine the orbital semi-major axis of exoplanets is given below.

summary(log_pl_orbsmax_mod_exh_back_bic)
## 
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_locale + st_optmag + st_vsini + 
##     log(pl_bmasse), data = planets_good)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.832 -0.826 -0.003  0.832  3.166 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.32730    0.25613    1.28      0.2    
## pl_localeSpace  1.11819    0.17278    6.47  2.0e-10 ***
## st_optmag      -0.38017    0.02037  -18.67  < 2e-16 ***
## st_vsini       -0.04028    0.00611   -6.60  9.4e-11 ***
## log(pl_bmasse)  0.38803    0.02919   13.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.18 on 589 degrees of freedom
## Multiple R-squared:  0.534,  Adjusted R-squared:  0.53 
## F-statistic:  168 on 4 and 589 DF,  p-value: <2e-16

The model we have chosen to predict the (logarithm of) planet orbital semi-major axis is an additive model with four predictors which are the locale of discovery, the stellar apparent magnitude, the stellar projected rotational velocity, and the logarithm of planet mass.

The coefficient for the locale of discovery is 1.1182. This means that exoplanets discovered using space-based observations are generally found to be further from their host stars than those discovered using ground-based observations by a factor of 3.0593. Most space-based exoplanet discoveries were made using the Kepler space observatory (https://exoplanetarchive.ipac.caltech.edu/docs/counts_detail.html). Kepler uses the transit search method to find exoplanets which is theoretically more sensitive to planets with wide orbits than is the radial velocity method which is the other major method for planet detection.

The coefficient for stellar apparent magnitude is -0.3802. This indicates that exoplanets with tighter orbits tend to be found around stars with higher apparent magnitude, and vice versa, i.e., exoplanets with wider orbits tend to be found around stars with lower apparent magnitude. The apparent magnitude of a star is a measure of its brightness as seen by an observer on Earth. Stellar magnitude is an inverse relation, i.e., the brighter an object appears, the lower its magnitude value and vice versa. Therefore, exoplanets with small orbits are more often found around faint stars as observed from Earth. This is not unexpected since the signal from faint stars has higher noise than from bright stars, and planets close-in to their host star are intrinsically more likely to be detected by the two most common techniques of planet detection, radial velocity method and transit search, than those farther from their host star.

The coefficient for stellar projected rotational velocity is -0.0403. For every unit increase in projected rotational velocity, there is a slight decrease in the orbital semi-major axis of a confirmed exoplanet. In other words, one is more likely to find an exoplanet in a tight orbit around a rapidly rotating star, and in a wide orbit around a slowly rotating star. This effect could come from observational bias. The radial velocity method for finding exoplanets relies on tracking spectral features from starlight, and stellar rotation broadens these features and making them less trackable. As mentioned previously, the radial velocity method is also more sensitive to planets in tighter orbits, so the radial velocity indicators from planets orbiting tightly around a rapidly rotating star are more likely to accurately observed than wider orbiting counterparts.

The coefficient of (the logarithm of) planet mass is 0.388. This means that more massive planets tend to be found orbiting farther from their host star than their less massive counterparts. This planet mass-orbit relation scales as a power law with an exponent of 0.388 (the corresonding coefficient in our model). This result could be physical, observational, or both. The massive gas giant planets in our solar system have wider orbits than the small terrestrial planets. Massive planets are more detectable than less massive planets, and the two most common methods for finding exoplanets, radial velocity method and transit search, are more sensitive to planets in closer orbits.

Our model is useful by highlighting the significant difference in planet discoveries made from space and those made on the ground. Our model can also be used to estimate the detection limits of finding planets around faint stars and rapidly rotating stars. Ideally, we would like to discover an Earth-mass exoplanet orbiting its host star at a distance where liquid water could exist and support life. The planet mass-orbit relation derived from our analysis of confirmed exoplanets could help to tell us how close we are or how much improvement we must make in our planet searches in order to discovery another Earth.

Appendix

The following code lists helper R functions that calculate or retrieve statistical values and report statistical decisions.

get_adj_r2 <- function(model) {
  summary(model)$adj.r.squared
}

get_loocv_rmse <- function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_num_params <- function(model) {
  length(coef(model))
}

get_bp_decision <- function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision <- function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

The following code is the multi-nested loop used to generate strings that were parsed and evaluated by R to fit a collection models in the exhaustive model search.

lm_str_head <- "cur_mod=lm(log(pl_orbsmax)~"
lm_str_tail <- ",data=planets_good)"
lm_str <- rep("", n)
i <- 1
for (ipl_discmethod in c("1", "pl_discmethod")) {
  for (ipl_disc in c("1", "pl_disc")) {
    for (ipl_locale in c("1", "pl_locale")) {
      for (ist_dist in c("1", "st_dist", "log(st_dist)")) {
        for (ist_optmag in c("1", "st_optmag", "exp(st_optmag)")) {
          for (ist_teff in c("1", "st_teff", "log(st_teff)")) {
            for (ist_mass in c("1", "st_mass", "log(st_mass)")) {
              for (ist_rad in c("1", "st_rad", "log(st_rad)")) {
                for (ist_logg in c("1", "st_logg", "exp(st_logg)")) {
                  for (ist_metfe in c("1", "st_metfe", "exp(st_metfe)")) {
                    for (ist_vsini in c("1", "st_vsini")) {
                      for (ipl_orbeccen in c("1", "pl_orbeccen")) {
                        for (ipl_bmasse in c("1", "pl_bmasse", "log(pl_bmasse)")) {
                          lm_str[i] <- paste(ipl_discmethod,
                                             ipl_disc,
                                             ipl_locale,
                                             ist_dist,
                                             ist_optmag,
                                             ist_teff,
                                             ist_mass,
                                             ist_rad,
                                             ist_logg,
                                             ist_metfe,
                                             ist_vsini,
                                             ipl_orbeccen,
                                             ipl_bmasse,
                                             sep = "+")
                          i <- i + 1
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}